4 research outputs found
Provable Robustness for Streaming Models with a Sliding Window
The literature on provable robustness in machine learning has primarily
focused on static prediction problems, such as image classification, in which
input samples are assumed to be independent and model performance is measured
as an expectation over the input distribution. Robustness certificates are
derived for individual input instances with the assumption that the model is
evaluated on each instance separately. However, in many deep learning
applications such as online content recommendation and stock market analysis,
models use historical data to make predictions. Robustness certificates based
on the assumption of independent input samples are not directly applicable in
such scenarios. In this work, we focus on the provable robustness of machine
learning models in the context of data streams, where inputs are presented as a
sequence of potentially correlated items. We derive robustness certificates for
models that use a fixed-size sliding window over the input stream. Our
guarantees hold for the average model performance across the entire stream and
are independent of stream size, making them suitable for large data streams. We
perform experiments on speech detection and human activity recognition tasks
and show that our certificates can produce meaningful performance guarantees
against adversarial perturbations
Exploring Geometry of Blind Spots in Vision Models
Despite the remarkable success of deep neural networks in a myriad of
settings, several works have demonstrated their overwhelming sensitivity to
near-imperceptible perturbations, known as adversarial attacks. On the other
hand, prior works have also observed that deep networks can be under-sensitive,
wherein large-magnitude perturbations in input space do not induce appreciable
changes to network activations. In this work, we study in detail the phenomenon
of under-sensitivity in vision models such as CNNs and Transformers, and
present techniques to study the geometry and extent of "equi-confidence" level
sets of such networks. We propose a Level Set Traversal algorithm that
iteratively explores regions of high confidence with respect to the input space
using orthogonal components of the local gradients. Given a source image, we
use this algorithm to identify inputs that lie in the same equi-confidence
level set as the source image despite being perceptually similar to arbitrary
images from other classes. We further observe that the source image is linearly
connected by a high-confidence path to these inputs, uncovering a star-like
structure for level sets of deep networks. Furthermore, we attempt to identify
and estimate the extent of these connected higher-dimensional regions over
which the model maintains a high degree of confidence. The code for this
project is publicly available at
https://github.com/SriramB-98/blindspots-neurips-subComment: 25 pages, 20 figures, Accepted at NeurIPS 2023 (spotlight
Can AI-Generated Text be Reliably Detected?
In this paper, both empirically and theoretically, we show that several
AI-text detectors are not reliable in practical scenarios. Empirically, we show
that paraphrasing attacks, where a light paraphraser is applied on top of a
large language model (LLM), can break a whole range of detectors, including
ones using watermarking schemes as well as neural network-based detectors and
zero-shot classifiers. Our experiments demonstrate that retrieval-based
detectors, designed to evade paraphrasing attacks, are still vulnerable to
recursive paraphrasing. We then provide a theoretical impossibility result
indicating that as language models become more sophisticated and better at
emulating human text, the performance of even the best-possible detector
decreases. For a sufficiently advanced language model seeking to imitate human
text, even the best-possible detector may only perform marginally better than a
random classifier. Our result is general enough to capture specific scenarios
such as particular writing styles, clever prompt design, or text paraphrasing.
We also extend the impossibility result to include the case where pseudorandom
number generators are used for AI-text generation instead of true randomness.
We show that the same result holds with a negligible correction term for all
polynomial-time computable detectors. Finally, we show that even LLMs protected
by watermarking schemes can be vulnerable against spoofing attacks where
adversarial humans can infer hidden LLM text signatures and add them to
human-generated text to be detected as text generated by the LLMs, potentially
causing reputational damage to their developers. We believe these results can
open an honest conversation in the community regarding the ethical and reliable
use of AI-generated text